Proteomics-Based Comparative Mapping of the Secretomes of Human Brown and White Adipocytes Reveals EPDR1 as a Novel Batokine

Adipokines secreted from white adipose tissue play a role in metabolic crosstalk and homeostasis, whereas the brown adipose secretome is less explored. We performed high-sensitivity mass-spec- trometry-based proteomics on the cell media of human adipocytes derived from the supraclavicular brown ad- ipose and from the subcutaneous white adipose de- pots of adult humans. We identified 471 potentially secreted proteins covering interesting categories such as hormones, growth factors, extracellular matrix proteins, and proteins of the complement system, which were differentially regulated between brown and white adipocytes. A total of 101 proteins were exclusively quantified in brown adipocytes, and among these was ependymin-related protein 1 (EPDR1). EPDR1 was detected in human plasma, and functional studies suggested a role for EPDR1 in thermogenic determination during adipogenesis. In conclusion, we report substantial differences between the secretomes of brown and white human adipocytes and identify novel candidate batokines that can be important regu- lators of human metabolism. REF: https://www.cell.com/cell-metabolism/fulltext/S1550-4131(19)30556-X

|graphical\_abstract.png|

[1]:
import os
import re
import pandas as pd
import numpy as np

import ckg.ckg_utils as ckg_utils
from ckg.graphdb_builder import builder_utils, mapping
from ckg.graphdb_builder.experiments.parsers import proteomicsParser
from ckg.graphdb_connector import connector

from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz

from ckg.report_manager import knowledge

from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
%matplotlib inline
init_notebook_mode(connected=True)
C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.3.10, the latest is 0.3.11.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  **kwargs
C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package outdated is out of date. Your version is 0.2.0, the latest is 0.2.1.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  **kwargs
WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.
[2]:
analysis_dir = '../../../../data/tmp/Deshmukh2019'
ckg_utils.checkDirectory(analysis_dir)

pxd_id = 'PXD008541'
file_name='SearchEngineResults_secretome.zip.rar'

Download Data from PRIDE

We can use functionality in graphdb_builder to directly download data files from EBI’s PRIDE database (https://www.ebi.ac.uk/pride/). For that you you just need to specify the PRIDE identifier for the project (PXD…) and the name of the file to download. In this case, the project identifier is PXD008541 and the file we will use is SearchEngineResults_secretome.zip.rar, a RAR compressed file.

[3]:
builder_utils.download_PRIDE_data(pxd_id=pxd_id,
                                  file_name=file_name,
                                  to=analysis_dir)
'publicationDate'
[3]:
{'status': 'INTERNAL_SERVER_ERROR',
 'code': 500,
 'message': 'Internal Server Error: Could not open JPA EntityManager for transaction; nested exception is javax.persistence.QueryTimeoutException: Could not open connection',
 'developerMessage': 'Please report to pride-support@ebi.ac.uk',
 'moreInfoUrl': None,
 'throwable': None}

Read Data In

Decompress File

[4]:
builder_utils.unrar(filepath=os.path.join(analysis_dir, file_name), to=analysis_dir)
Error: ../../../../data/tmp/Deshmukh2019\SearchEngineResults_secretome.zip.rar. Could not unrar file Unrar not installed? (rarfile.UNRAR_TOOL='unrar')

The list of files within the compressed folder can be listed using the listDirectoryFiles functionality in gaphdb_builder.

[5]:
builder_utils.listDirectoryFiles(analysis_dir)
[5]:
['anova_results.tsv',
 'BAT_woNE_vs_WAT_woNE_volcano_plot.svg',
 'missing_values_per_sample.svg',
 'original_data_boxplot.svg',
 'original_data_distplot.svg',
 'original_data_parallel_plot_upregulated_proteins.svg',
 'pca.svg',
 'processed_data_boxplot.svg',
 'processed_data_distplot.svg',
 'processed_data_matrix.tsv',
 'processed_data_parallel_plot_upregulated_proteins.svg',
 'SearchEngineResults_secretome.zip.rar',
 'secreted_processed_data.tsv',
 'secretome.tsv']

We use the proteinGroups file that contains the proteomics data processed using MaxQuant software.

[6]:
proteinGroups_file = os.path.join(os.path.join(analysis_dir, "SearchEngineResults_secretome"), 'proteinGroups.txt')

Parse Contents

CKG has parsers for MaxQuant and Spectronaut output files. The default configuration needed to parse these files needs to be updated with the name of the columns containing the protein quantifications for each sample. Also, the default configuration can be adapted to the experiment by selected specific filters or removing non-used columns. For example, in this study the output file did not have columns: Score, Q-value, so we removed them from the configuration and the column ‘Potential contaminant’ was renamed to ‘Contaminant’ so we changed the name in the filters.

[7]:
#d = pd.read_csv(proteinGroups_file, sep='\t')
#d.columns.tolist()
[8]:
columns = ['LFQ intensity BAT_NE1',
           'LFQ intensity BAT_NE2',
           'LFQ intensity BAT_NE3',
           'LFQ intensity BAT_NE4',
           'LFQ intensity BAT_NE5',
           'LFQ intensity BAT_woNE1',
           'LFQ intensity BAT_woNE2',
           'LFQ intensity BAT_woNE3',
           'LFQ intensity BAT_woNE4',
           'LFQ intensity BAT_woNE5',
           'LFQ intensity WAT_NE1',
           'LFQ intensity WAT_NE2',
           'LFQ intensity WAT_NE3',
           'LFQ intensity WAT_NE4',
           'LFQ intensity WAT_NE5',
           'LFQ intensity WAT_woNE1',
           'LFQ intensity WAT_woNE2',
           'LFQ intensity WAT_woNE3',
           'LFQ intensity WAT_woNE4',
           'LFQ intensity WAT_woNE5',
           'Contaminant']
[9]:
configuration = proteomicsParser.update_configuration(data_type='proteins',
                                                      processing_tool='maxquant',
                                                      value_col='LFQ intensity',
                                                      columns=columns,
                                                      regex=r'(\w+_\w+\d+)',
                                                      drop_cols=['Score', 'Q-value', 'Potential contaminant'],
                                                      filters=['Reverse', 'Only identified by site', 'Contaminant'])
[10]:
configuration
[10]:
{'columns': ['Majority protein IDs',
  'Razor + unique peptides',
  'id',
  'LFQ intensity AS\\d+_?-?\\d*',
  'Intensity AS\\d+_?-?\\d*',
  'Reverse',
  'Only identified by site',
  'is_razor',
  'LFQ intensity BAT_NE1',
  'LFQ intensity BAT_NE2',
  'LFQ intensity BAT_NE3',
  'LFQ intensity BAT_NE4',
  'LFQ intensity BAT_NE5',
  'LFQ intensity BAT_woNE1',
  'LFQ intensity BAT_woNE2',
  'LFQ intensity BAT_woNE3',
  'LFQ intensity BAT_woNE4',
  'LFQ intensity BAT_woNE5',
  'LFQ intensity WAT_NE1',
  'LFQ intensity WAT_NE2',
  'LFQ intensity WAT_NE3',
  'LFQ intensity WAT_NE4',
  'LFQ intensity WAT_NE5',
  'LFQ intensity WAT_woNE1',
  'LFQ intensity WAT_woNE2',
  'LFQ intensity WAT_woNE3',
  'LFQ intensity WAT_woNE4',
  'LFQ intensity WAT_woNE5',
  'Contaminant'],
 'generated_columns': ['is_razor'],
 'filters': ['Reverse', 'Only identified by site', 'Contaminant'],
 'proteinCol': 'Majority protein IDs',
 'contaminant_tag': 'CON__',
 'valueCol': 'LFQ intensity',
 'groupCol': 'id',
 'indexCol': 'Majority protein IDs',
 'attributes': {'cols': ['id', 'is_razor'], 'regex': ['Intensity']},
 'log': 'log2',
 'file': 'proteinGroups.txt',
 'regex': '(\\w+_\\w+\\d+)'}

When we parse the data, we obtain a matrix in an edge list following CKG’s graph format: sample, protein, realtionship_type, value, protein_group_id, is_razor

[11]:
data = proteomicsParser.parser_from_file(proteinGroups_file, configuration=configuration, data_type='proteins')[('proteins', 'w')]
[12]:
data.head()
[12]:
START_ID END_ID TYPE value id is_razor
0 WAT_woNE5 A0AVL1 HAS_QUANTIFIED_PROTEIN 20.109423 0 False
1 BAT_NE5 A1A441 HAS_QUANTIFIED_PROTEIN 22.986122 1 False
2 BAT_woNE3 A1A441 HAS_QUANTIFIED_PROTEIN 22.753054 1 False
3 BAT_woNE5 A1A441 HAS_QUANTIFIED_PROTEIN 22.475955 1 False
4 BAT_NE3 A1A441 HAS_QUANTIFIED_PROTEIN 23.317552 1 False
[13]:
data.columns = ['sample', 'identifier', 'relationship', 'LFQ intensity', 'id', 'is_razor']
[14]:
data.head()
[14]:
sample identifier relationship LFQ intensity id is_razor
0 WAT_woNE5 A0AVL1 HAS_QUANTIFIED_PROTEIN 20.109423 0 False
1 BAT_NE5 A1A441 HAS_QUANTIFIED_PROTEIN 22.986122 1 False
2 BAT_woNE3 A1A441 HAS_QUANTIFIED_PROTEIN 22.753054 1 False
3 BAT_woNE5 A1A441 HAS_QUANTIFIED_PROTEIN 22.475955 1 False
4 BAT_NE3 A1A441 HAS_QUANTIFIED_PROTEIN 23.317552 1 False
[15]:
data.shape
[15]:
(57470, 6)
[16]:
data = data[data.is_razor]
[17]:
data.shape
[17]:
(17489, 6)

We can use the sample names to extract the group information: BAT_NE, WAT_NE, BAT_woNE, WAT_woNE

With this last column, we obtain the original dataframe used as starting point in CKG’ analysis pipelines.

[18]:
data['group'] = data['sample'].apply(lambda x: re.sub('\d', '', x))
[19]:
data.head()
[19]:
sample identifier relationship LFQ intensity id is_razor group
7 BAT_NE4 A1L4H1 HAS_QUANTIFIED_PROTEIN 21.686542 2 True BAT_NE
134 WAT_NE4 A6NCN2 HAS_QUANTIFIED_PROTEIN 27.843440 10 True WAT_NE
135 BAT_woNE2 A6NDG6 HAS_QUANTIFIED_PROTEIN 23.446827 11 True BAT_woNE
136 BAT_NE5 A6NDG6 HAS_QUANTIFIED_PROTEIN 23.047168 11 True BAT_NE
137 BAT_woNE3 A6NDG6 HAS_QUANTIFIED_PROTEIN 23.307164 11 True BAT_woNE
[20]:
original = data[['group', 'sample', 'identifier', 'LFQ intensity']]
[21]:
df = analytics.transform_proteomics_edgelist(original,
                                           index_cols=['group', 'sample'],
                                           drop_cols=['sample'],
                                           group='group',
                                           identifier='identifier',
                                           extra_identifier=None,
                                           value_col='LFQ intensity')
[22]:
df.shape
[22]:
(20, 1868)
[23]:
df.describe()
[23]:
identifier A1L4H1 A6NCN2 A6NDG6 A8K2U0 B3EWG6 B3KW70 B4DUF5 B4DUW4 D6RD97 D6REB5 ... Q9Y680 Q9Y696 Q9Y6B6 Q9Y6C2 Q9Y6C9 Q9Y6D6 Q9Y6E0 Q9Y6E2 Q9Y6G9 Q9Y6I3
count 1.000000 1.00000 8.000000 5.000000 1.000000 20.000000 1.000000 1.00000 1.000000 1.00000 ... 1.000000 14.000000 1.000000 1.000000 1.000000 1.000000 2.000000 1.000000 2.000000 7.000000
mean 21.686542 27.84344 23.357560 21.966594 22.579298 28.716408 20.063026 24.12849 22.271874 20.50174 ... 19.579739 25.661670 20.637148 22.149265 20.798991 21.932686 20.572377 21.000033 23.177596 22.972208
std NaN NaN 0.326612 0.497770 NaN 0.928132 NaN NaN NaN NaN ... NaN 0.862677 NaN NaN NaN NaN 0.105813 NaN 0.040308 0.617836
min 21.686542 27.84344 22.817377 21.346029 22.579298 27.265350 20.063026 24.12849 22.271874 20.50174 ... 19.579739 24.221919 20.637148 22.149265 20.798991 21.932686 20.497556 21.000033 23.149094 22.132167
25% 21.686542 27.84344 23.210751 21.830435 22.579298 28.257911 20.063026 24.12849 22.271874 20.50174 ... 19.579739 25.036140 20.637148 22.149265 20.798991 21.932686 20.534966 21.000033 23.163345 22.548973
50% 21.686542 27.84344 23.376995 21.896706 22.579298 28.404056 20.063026 24.12849 22.271874 20.50174 ... 19.579739 25.533329 20.637148 22.149265 20.798991 21.932686 20.572377 21.000033 23.177596 23.089712
75% 21.686542 27.84344 23.538843 22.032815 22.579298 29.141976 20.063026 24.12849 22.271874 20.50174 ... 19.579739 26.233573 20.637148 22.149265 20.798991 21.932686 20.609788 21.000033 23.191847 23.282008
max 21.686542 27.84344 23.867311 22.726982 22.579298 30.812377 20.063026 24.12849 22.271874 20.50174 ... 19.579739 26.981961 20.637148 22.149265 20.798991 21.932686 20.647198 21.000033 23.206099 23.921616

8 rows × 1866 columns

[24]:
aux = df.set_index(['sample']).T.isna().sum().reset_index().sort_values(by='sample')
plot = viz.get_barplot(aux, identifier='missing', args={'x':0,
                                                 'y':'sample',
                                                 'title':"Missing values per sample",
                                                 "orientation":'h',
                                                "x_title":'# missing values', 'y_title':'samples',
                                                 'height':900, 'width':400})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='missing_values_per_sample', plot_format='svg', directory=analysis_dir)
[25]:
aux = df.sort_values(by='sample').set_index(['group', 'sample']).stack().reset_index()
aux.columns = ['group', 'sample', 'identifier', 'value']
plot = viz.get_boxplot_grid(aux, identifier='boxplot', args={"Title":"Boxplot", 'x':'sample', 'y':'value', 'color':'group', 'axis':'cols', 'height':200})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='original_data_boxplot', plot_format='svg', directory=analysis_dir)
IN
[26]:
aux = df.sort_values(by='sample').set_index(['group', 'sample']).stack().reset_index()
aux.columns = ['group', 'sample', 'identifier', 'value']
plot = viz.get_histogram(aux, identifier='histogram', args={'x':'value', 'color':'group', 'facet_row':'sample', 'title':'Facet Grid Plot'})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='original_data_distplot', plot_format='svg', directory=analysis_dir)

Data Preparation

In order to prepare the data we follow the steps:

  1. Filtering based on missing values

  2. Imputation of missing values using a mixed model estrategy: KNN and MinProb

These steps will generate the processed dataframe, a complete matrix that can be used in the exploratory and statistical analysis.

[27]:
processed_data = analytics.get_proteomics_measurements_ready(original,
                                                             index_cols=['group', 'sample'],
                                                             drop_cols=['sample'],
                                                             group='group',
                                                             identifier='identifier',
                                                             extra_identifier=None,
                                                             imputation=True,
                                                             method='mixed',
                                                             knn_cutoff=0.4,
                                                             missing_method='at_least_x',
                                                             missing_per_group=True,
                                                             min_valid=3,
                                                             value_col='LFQ intensity',
                                                             shift=1.8,
                                                             nstd=0.3)
[28]:
processed_data.head()
[28]:
identifier group sample A6NDG6 B3KW70 E9PAV3 E9PGF5 E9PHK0 F5GWP8 F8W031 G3V3G9 ... Q9Y4Y9 Q9Y5P4 Q9Y5X3 Q9Y5Z4 Q9Y600 Q9Y617 Q9Y646 Q9Y678 Q9Y696 Q9Y6I3
0 BAT_NE BAT_NE1 22.948430 29.091339 26.578739 23.563604 27.322081 27.821164 24.420405 23.947173 ... 22.109509 23.130157 22.569201 28.647187 24.888148 26.012226 22.746304 23.610317 24.676053 23.408469
1 BAT_NE BAT_NE2 22.954981 28.610591 27.468243 23.811590 26.283462 27.452462 23.071490 23.891354 ... 22.109780 23.312664 22.572965 28.026016 24.164153 26.294796 22.171377 25.493292 24.928906 22.570449
2 BAT_NE BAT_NE3 22.817377 28.272086 26.773793 23.708936 25.299149 27.471115 22.773441 23.618293 ... 22.104088 23.378619 22.493896 27.956985 25.221587 26.440457 22.150598 25.144410 26.972790 22.340063
3 BAT_NE BAT_NE4 22.937884 30.812377 26.089441 23.757767 28.160532 28.601127 23.301111 23.881632 ... 22.109072 22.237510 22.563140 27.506049 24.793329 24.906694 22.693322 23.264133 25.210342 22.709767
4 BAT_NE BAT_NE5 23.047168 28.364277 26.971255 24.055731 25.989885 26.889135 22.317248 24.034814 ... 22.113593 23.397543 22.625936 28.196715 25.171234 26.397037 21.778724 24.573057 26.251168 22.132167

5 rows × 1169 columns

[29]:
processed_data.describe()
[29]:
identifier A6NDG6 B3KW70 E9PAV3 E9PGF5 E9PHK0 F5GWP8 F8W031 G3V3G9 H0YFH3 K7ERI9 ... Q9Y4Y9 Q9Y5P4 Q9Y5X3 Q9Y5Z4 Q9Y600 Q9Y617 Q9Y646 Q9Y678 Q9Y696 Q9Y6I3
count 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 ... 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000
mean 23.128097 28.716408 26.027184 23.259318 26.659409 28.309229 22.856406 22.577696 22.564280 23.144775 ... 22.841194 22.060772 22.595484 28.287357 24.694083 26.657958 22.913560 24.756667 25.660985 22.282274
std 0.491487 0.928132 1.136864 0.969810 0.725931 0.991160 0.747094 1.013024 1.354932 1.273632 ... 1.261612 1.114248 0.570500 0.644960 0.580041 0.830303 0.777069 0.794566 0.762354 1.024486
min 21.887487 27.265350 23.243551 20.776162 25.299149 26.639618 21.241523 21.012823 19.727160 21.080893 ... 20.098448 20.103549 21.390805 26.324229 23.570569 24.906694 21.778724 23.264133 24.221919 20.460160
25% 22.907757 28.257911 25.233104 22.853012 26.166929 27.678857 22.431878 21.862881 21.612875 22.236824 ... 22.107826 21.197256 22.329569 28.087609 24.244477 26.240169 22.317168 24.292354 25.157052 21.392938
50% 23.286221 28.404056 25.972840 23.568937 26.705618 28.368931 22.750731 22.405116 22.260924 23.342752 ... 22.955144 22.188302 22.566171 28.330765 24.752862 26.644405 22.889030 24.837943 25.533329 22.455256
75% 23.479114 29.141976 26.977802 23.819750 27.171606 28.939251 23.155947 23.501777 23.464925 23.481043 ... 23.662340 23.125324 23.103945 28.546374 25.067533 27.341107 23.248793 25.396285 26.217616 23.167254
max 23.867311 30.812377 27.468243 24.479142 28.160532 30.258672 24.420405 24.034814 25.691071 27.025171 ... 25.020940 23.445817 23.364528 29.240840 25.761494 27.822988 24.681853 26.156805 26.981961 23.921616

8 rows × 1167 columns

[30]:
aux = processed_data.sort_values(by='sample').set_index(['group', 'sample']).stack().reset_index()
aux.columns = ['group', 'sample', 'identifier', 'value']
plot = viz.get_boxplot_grid(aux, identifier='boxplot', args={"Title":"Boxplot", 'x':'sample', 'y':'value', 'color':'group', 'axis':'cols', 'height':200})#
iplot(plot.figure)
viz.save_DASH_plot(plot, name='processed_data_boxplot', plot_format='svg', directory=analysis_dir)
IN
[31]:
aux = processed_data.sort_values(by='sample').set_index(['group', 'sample']).stack().reset_index()
aux.columns = ['group', 'sample', 'identifier', 'value']
plot = viz.get_histogram(aux, identifier='histogram', args={'x':'value', 'color':'group', 'facet_row':'sample', 'title':'Facet Grid Plot'})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='processed_data_distplot', plot_format='svg', directory=analysis_dir)

Mapping Identifiers to Names

CKG graph database can be used easily to map UniProt identifiers to protein names that are more commonly used and understood. We build unique identifiers by combining the UniProt ids with the names: name_id

[32]:
mapping_dict = mapping.getMappingFromDatabase(processed_data, node='Protein', attribute_from='id', attribute_to='name')
[33]:
processed_data.columns = [str(mapping_dict[c])+"_"+c if c in mapping_dict else c for c in processed_data.columns]
[34]:
processed_data.head()
[34]:
group sample PGP_A6NDG6 MFAP5_B3KW70 NACA_E9PAV3 TNS1_E9PGF5 CLEC3B_E9PHK0 KRT17_F5GWP8 None_F8W031 G3V3G9 ... LSM5_Q9Y4Y9 CERT1_Q9Y5P4 SNX5_Q9Y5X3 HEBP2_Q9Y5Z4 CSAD_Q9Y600 PSAT1_Q9Y617 CPQ_Q9Y646 COPG1_Q9Y678 CLIC4_Q9Y696 EPN1_Q9Y6I3
0 BAT_NE BAT_NE1 22.948430 29.091339 26.578739 23.563604 27.322081 27.821164 24.420405 23.947173 ... 22.109509 23.130157 22.569201 28.647187 24.888148 26.012226 22.746304 23.610317 24.676053 23.408469
1 BAT_NE BAT_NE2 22.954981 28.610591 27.468243 23.811590 26.283462 27.452462 23.071490 23.891354 ... 22.109780 23.312664 22.572965 28.026016 24.164153 26.294796 22.171377 25.493292 24.928906 22.570449
2 BAT_NE BAT_NE3 22.817377 28.272086 26.773793 23.708936 25.299149 27.471115 22.773441 23.618293 ... 22.104088 23.378619 22.493896 27.956985 25.221587 26.440457 22.150598 25.144410 26.972790 22.340063
3 BAT_NE BAT_NE4 22.937884 30.812377 26.089441 23.757767 28.160532 28.601127 23.301111 23.881632 ... 22.109072 22.237510 22.563140 27.506049 24.793329 24.906694 22.693322 23.264133 25.210342 22.709767
4 BAT_NE BAT_NE5 23.047168 28.364277 26.971255 24.055731 25.989885 26.889135 22.317248 24.034814 ... 22.113593 23.397543 22.625936 28.196715 25.171234 26.397037 21.778724 24.573057 26.251168 22.132167

5 rows × 1169 columns

[35]:
processed_data.to_csv('{}/processed_data_matrix.tsv'.format(analysis_dir), sep='\t', index=False)

Principal Component Analysis

[36]:
pca_result, args = analytics.run_pca(processed_data, drop_cols=['sample'], group='group')
args.update({"loadings":10, "title":'PCA plot', 'x_title':'PC1', 'y_title':'PC2', 'height':600, 'width':700})
[37]:
plot = viz.get_pca_plot(pca_result, identifier='pca', args=args)
iplot(plot.figure)
viz.save_DASH_plot(plot, name='pca', plot_format='svg', directory=analysis_dir)

Secretome Human Protein Atlas

This project requires a filtering based on known/predicted secreted proteins (secretome). Currently, we don’t have this information in CKG’s graph database but we can use functionality in graphdb_builder to download any database (file) from an URL.

[38]:
secretome_url = 'https://www.proteinatlas.org/search/sa_location%3ASecreted+-+unknown+location%2CSecreted+in+brain%2CSecreted+in+female+reproductive+system%2CSecreted+in+male+reproductive+system%2CSecreted+in+other+tissues%2CSecreted+to+blood%2CSecreted+to+digestive+system%2CSecreted+to+extracellular+matrix%2CIntracellular+and+membrane?format=tsv'
[39]:
builder_utils.downloadDB(databaseURL=secretome_url, directory=analysis_dir, file_name='secretome.tsv', avoid_wget=True)
[40]:
secretome = pd.read_csv(os.path.join(analysis_dir, 'secretome.tsv'), sep='\t')
[41]:
secretome.head()
[41]:
Gene Gene synonym Ensembl Gene description Uniprot Chromosome Position Protein class Biological process Molecular function ... Single Cell Type RNA - Rod photoreceptor cells [NX] Single Cell Type RNA - Sertoli cells [NX] Single Cell Type RNA - Smooth muscle cells [NX] Single Cell Type RNA - Spermatocytes [NX] Single Cell Type RNA - Spermatogonia [NX] Single Cell Type RNA - Suprabasal keratinocytes [NX] Single Cell Type RNA - Syncytiotrophoblasts [NX] Single Cell Type RNA - T-cells [NX] Single Cell Type RNA - Undifferentiated cells [NX] Single Cell Type RNA - Urothelial cells [NX]
0 A1BG NaN ENSG00000121410 Alpha-1-B glycoprotein P04217 19 58345178-58353499 Plasma proteins, Predicted intracellular prote... NaN NaN ... 5.2 25.8 52.4 1.4 5.1 3.1 1.6 34.8 0.1 0.6
1 A2M CPAMD5, FWP007, S863-7 ENSG00000175899 Alpha-2-macroglobulin P01023 12 9067664-9116229 Cancer-related genes, Candidate cardiovascular... NaN Protease inhibitor, Serine protease inhibitor ... 18.6 329.8 207.9 1.8 1.1 12.3 3.4 16.7 0.0 2.2
2 A2ML1 CPAMD9, FLJ25179, p170 ENSG00000166535 Alpha-2-macroglobulin like 1 A8K2U0 12 8822472-8887001 Predicted intracellular proteins, Predicted se... NaN Protease inhibitor, Serine protease inhibitor ... 0.2 0.0 0.3 1.4 0.0 11.0 0.1 0.0 0.0 4.9
3 A4GNT alpha4GnT ENSG00000118017 Alpha-1,4-N-acetylglucosaminyltransferase Q9UNA3 3 138123718-138132387 Predicted intracellular proteins NaN Glycosyltransferase, Transferase ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 AADACL2 MGC72001 ENSG00000197953 Arylacetamide deacetylase like 2 Q6P093 3 151733916-151761339 Predicted secreted proteins NaN Hydrolase ... 0.0 0.0 0.0 0.0 0.0 26.8 0.1 0.0 0.0 0.0

5 rows × 292 columns

[42]:
secretome.shape
[42]:
(2641, 292)
[43]:
mapping_dict = mapping.getMappingFromDatabase(secretome['Uniprot'].tolist(), node='Protein', attribute_from='id', attribute_to='name')
[44]:
secretome['identifier'] = [str(mapping_dict[p])+"_"+p if p in mapping_dict else p for p in secretome['Uniprot'].tolist()]
[45]:
secretome.head()
[45]:
Gene Gene synonym Ensembl Gene description Uniprot Chromosome Position Protein class Biological process Molecular function ... Single Cell Type RNA - Sertoli cells [NX] Single Cell Type RNA - Smooth muscle cells [NX] Single Cell Type RNA - Spermatocytes [NX] Single Cell Type RNA - Spermatogonia [NX] Single Cell Type RNA - Suprabasal keratinocytes [NX] Single Cell Type RNA - Syncytiotrophoblasts [NX] Single Cell Type RNA - T-cells [NX] Single Cell Type RNA - Undifferentiated cells [NX] Single Cell Type RNA - Urothelial cells [NX] identifier
0 A1BG NaN ENSG00000121410 Alpha-1-B glycoprotein P04217 19 58345178-58353499 Plasma proteins, Predicted intracellular prote... NaN NaN ... 25.8 52.4 1.4 5.1 3.1 1.6 34.8 0.1 0.6 A1BG_P04217
1 A2M CPAMD5, FWP007, S863-7 ENSG00000175899 Alpha-2-macroglobulin P01023 12 9067664-9116229 Cancer-related genes, Candidate cardiovascular... NaN Protease inhibitor, Serine protease inhibitor ... 329.8 207.9 1.8 1.1 12.3 3.4 16.7 0.0 2.2 A2M_P01023
2 A2ML1 CPAMD9, FLJ25179, p170 ENSG00000166535 Alpha-2-macroglobulin like 1 A8K2U0 12 8822472-8887001 Predicted intracellular proteins, Predicted se... NaN Protease inhibitor, Serine protease inhibitor ... 0.0 0.3 1.4 0.0 11.0 0.1 0.0 0.0 4.9 A2ML1_A8K2U0
3 A4GNT alpha4GnT ENSG00000118017 Alpha-1,4-N-acetylglucosaminyltransferase Q9UNA3 3 138123718-138132387 Predicted intracellular proteins NaN Glycosyltransferase, Transferase ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 A4GNT_Q9UNA3
4 AADACL2 MGC72001 ENSG00000197953 Arylacetamide deacetylase like 2 Q6P093 3 151733916-151761339 Predicted secreted proteins NaN Hydrolase ... 0.0 0.0 0.0 0.0 26.8 0.1 0.0 0.0 0.0 AADACL2_Q6P093

5 rows × 293 columns

[46]:
secretome_cols = [c for c in processed_data.columns if c in secretome['identifier'].tolist()]
[47]:
sprocessed_data = processed_data[secretome_cols +['sample', 'group']]
[48]:
sprocessed_data.head()
[48]:
DNASE2_O00115 QSOX1_O00391 ACACB_O00763 ISLR_O14498 HSPB6_O14558 NRP1_O14786 CALU_O43852 CUTA_O60888 SLIT3_O75094 CILP_O75339 ... DPP7_Q9UHL4 CPA4_Q9UI42 PCOLCE2_Q9UKZ9 EPDR1_Q9UM22 ITM2B_Q9Y287 LOXL2_Q9Y4K0 CSAD_Q9Y600 CPQ_Q9Y646 sample group
0 24.114083 24.138056 30.128417 23.468740 29.996177 22.214779 31.531017 24.544570 22.014731 23.388637 ... 20.294032 23.102487 22.018794 23.550629 24.847902 23.296214 24.888148 22.746304 BAT_NE1 BAT_NE
1 24.119415 25.387356 30.391512 23.489224 29.513447 23.831518 30.589780 24.264634 21.450402 26.813356 ... 21.067531 21.805934 24.547985 22.870285 24.352738 23.289641 24.164153 22.171377 BAT_NE2 BAT_NE
2 24.161902 23.798872 29.734602 23.429947 29.180208 22.385902 30.160027 24.237612 27.257009 23.044385 ... 20.676194 21.570111 23.086508 22.662654 24.072838 22.711249 25.221587 22.150598 BAT_NE3 BAT_NE
3 23.993172 24.679493 28.181251 23.892080 28.725074 23.909215 30.443124 24.276151 21.425337 29.299769 ... 21.066661 21.807458 24.590637 23.047072 24.424457 24.206725 24.793329 22.693322 BAT_NE4 BAT_NE
4 24.193250 23.775452 30.909635 22.821237 29.365446 22.607023 30.262149 23.939482 21.384819 26.076288 ... 20.465266 22.136305 23.911137 22.389163 24.190087 23.140772 25.171234 21.778724 BAT_NE5 BAT_NE

5 rows × 243 columns

[49]:
sprocessed_data.to_csv('{}/secreted_processed_data.tsv'.format(analysis_dir), sep='\t', index=False)

Differential Regulation

[50]:
alpha = 0.05
fc = 2
aov_result = analytics.run_anova(sprocessed_data, alpha=alpha, drop_cols=['sample'], subject=None, group='group', permutations=0, correction='fdr_bh', is_logged=True)
[51]:
aov_result.head()
[51]:
identifier group1 group2 mean(group1) std(group1) mean(group2) std(group2) posthoc Paired posthoc Parametric posthoc T-Statistics ... FC efftype F-statistics pvalue padj correction rejected -log10 pvalue Method posthoc padj
0 A1BG_P04217 BAT_NE BAT_woNE 22.431113 1.053262 21.159121 0.166747 False True 2.667213 ... 2.414949 hedges 5.335302 0.009697 0.024255 FDR correction BH True 1.545383 One-way anova 0.177929
1 A1BG_P04217 BAT_NE WAT_NE 22.431113 1.053262 21.424311 0.425668 False True 1.981715 ... 2.009452 hedges 5.335302 0.009697 0.024255 FDR correction BH True 1.081820 One-way anova 0.207934
2 A1BG_P04217 BAT_NE WAT_woNE 22.431113 1.053262 22.382288 0.529607 False True 0.092607 ... 1.034422 hedges 5.335302 0.009697 0.024255 FDR correction BH True 0.032221 One-way anova 0.960372
3 A1BG_P04217 BAT_woNE WAT_NE 21.159121 0.166747 21.424311 0.425668 False True -1.297094 ... 0.832089 hedges 5.335302 0.009697 0.024255 FDR correction BH True 0.636851 One-way anova 0.394409
4 A1BG_P04217 BAT_woNE WAT_woNE 21.159121 0.166747 22.382288 0.529607 False True -4.925975 ... 0.428341 hedges 5.335302 0.009697 0.024255 FDR correction BH True 2.937312 One-way anova 0.009281

5 rows × 26 columns

[52]:
aov_result.loc[aov_result.rejected].identifier.unique().shape
[52]:
(122,)
[53]:
aov_result.to_csv('{}/anova_results.tsv'.format(analysis_dir), sep='\t', index=False)
[54]:
args={'alpha':alpha,
      'fc':fc,
      'colorscale':'Blues',
      'showscale': False,
      'marker_size':10,
      'num_annotations':480,
      'x_title':'log2FC',
      'y_title':'-log10(pvalue)'}
figures = viz.run_volcano(aov_result, identifier='volcano', args=args)
i = 1
for figure in figures:
    if figure.figure['layout'].title['text'] == 'Comparison: BAT_woNE vs WAT_woNE':
        iplot(figure.figure)
        i += 1
        viz.save_DASH_plot(figure, name='BAT_woNE_vs_WAT_woNE_volcano_plot', plot_format='svg', directory=analysis_dir)

Analysis of Upregulated Proteins in Unestimulated BAT va WAT

[55]:
upregulated_proteins = aov_result[(aov_result['group1'] == 'BAT_woNE') &
                                  (aov_result['group2'] == 'WAT_woNE') &
                                  (aov_result['posthoc padj'] <= alpha) &
                                  (aov_result['log2FC'] >=1)]
[56]:
selected_df = df.set_index("group").loc[['BAT_woNE', 'WAT_woNE'], [p.split('_')[1] if len(p.split('_')) > 1 else p for p in upregulated_proteins.identifier.unique().tolist()]].reset_index()
selected_df.groupby("group").mean()

[56]:
identifier P02749 P61769 Q07021 P13671 P12532 P10909 P53634 Q9UBR2 O00115 P26885 ... P13667 Q13162 P02753 P34096 Q99584 O95969 P31947 P10124 Q99727 Q13263
group
BAT_woNE 25.182105 30.143430 26.047659 24.232947 28.330328 27.380916 24.416099 25.82257 23.386786 23.663076 ... 25.794822 23.634525 29.731995 26.620535 26.522128 24.171815 27.728345 24.140254 24.868147 24.125775
WAT_woNE 24.559642 28.309072 24.776456 NaN NaN 25.401467 NaN NaN NaN NaN ... 25.519324 NaN 28.364350 26.031432 25.062536 NaN NaN NaN NaN NaN

2 rows × 30 columns

[57]:
plot = viz.get_parallel_plot(selected_df,
                      identifier='parallel_plot',
                      args={"title": "Parallel Plot Upregulated Proteins (original data)", "zscore":False, "group":"group"})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='original_data_parallel_plot_upregulated_proteins', plot_format='svg', directory=analysis_dir)
[58]:
selected_df = processed_data.set_index("group").loc[['BAT_woNE', 'WAT_woNE'], upregulated_proteins.identifier.unique().tolist()].reset_index()
selected_df.groupby("group").mean()
[58]:
APOH_P02749 B2M_P61769 C1QBP_Q07021 C6_P13671 CKMT1A_P12532 CLU_P10909 CTSC_P53634 CTSZ_Q9UBR2 DNASE2_O00115 FKBP2_P26885 ... PDIA4_P13667 PRDX4_Q13162 RBP4_P02753 RNASE4_P34096 S100A13_Q99584 SCGB1D2_O95969 SFN_P31947 SRGN_P10124 TIMP4_Q99727 TRIM28_Q13263
group
BAT_woNE 25.182105 30.143430 26.047659 24.222532 28.330328 27.380916 24.421102 25.822570 23.376689 23.665083 ... 25.794822 23.633922 29.731995 26.620535 26.522128 24.171815 27.728345 24.16699 24.899663 24.149016
WAT_woNE 23.212651 28.309072 22.956111 22.213605 22.204772 25.334937 23.257274 22.665958 22.361763 22.391318 ... 23.120770 22.283522 28.364350 23.248443 25.036334 22.413116 22.685906 21.99328 22.562938 22.603027

2 rows × 30 columns

[59]:
plot = viz.get_parallel_plot(selected_df,
                      identifier='parallel_plot',
                      args={"title": "Parallel Plot Upregulated Proteins", "zscore":False, "group":"group"})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='processed_data_parallel_plot_upregulated_proteins', plot_format='svg', directory=analysis_dir)

Analysis Downregulated Proteins in Unestimulated BAT va WAT

[60]:
downregulated_proteins = aov_result[(aov_result['group1'] == 'BAT_woNE') &
                                  (aov_result['group2'] == 'WAT_woNE') &
                                  (aov_result['posthoc padj'] <= alpha) &
                                  (aov_result['log2FC'] <= -1)]
[61]:
selected_df = df.set_index("group").loc[['BAT_woNE', 'WAT_woNE'], [p.split('_')[1] if len(p.split('_')) > 1 else p for p in downregulated_proteins.identifier.unique().tolist()]].reset_index()
selected_df.groupby("group").mean()

[61]:
identifier P04217 P39059 P39060 P12111 Q9Y646 P10619 P07711 Q07507 O94769 Q92520 ... Q8NHP8 P42785 Q9BRX8 P50454 Q9HAT2 P78539 P10646 P02786 Q15582 P24821
group
BAT_woNE NaN 23.520278 NaN 30.474014 22.265727 23.106582 24.691102 24.647780 NaN NaN ... 22.702228 24.413882 21.646408 26.485985 23.519893 25.031933 22.578169 21.273611 NaN 22.252515
WAT_woNE NaN 24.732646 23.784856 31.820459 23.504914 24.305533 25.785085 25.878753 NaN NaN ... 24.245612 25.850791 23.023065 27.525384 23.439551 26.176678 NaN NaN 24.882087 24.936074

2 rows × 30 columns

[62]:
plot = viz.get_parallel_plot(selected_df,
                      identifier='parallel_plot',
                      args={"title": "Parallel Plot Downregulated Proteins (original data)", "zscore":False, "group":"group"})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='original_data_parallel_plot_upregulated_proteins', plot_format='svg', directory=analysis_dir)
[63]:
selected_df = processed_data.set_index("group").loc[['BAT_woNE', 'WAT_woNE'], downregulated_proteins.identifier.unique().tolist()]
selected_df.groupby("group").mean()
[63]:
A1BG_P04217 COL15A1_P39059 COL18A1_P39060 COL6A3_P12111 CPQ_Q9Y646 CTSA_P10619 CTSL_P07711 DPT_Q07507 ECM2_O94769 FAM3C_Q92520 ... PLBD2_Q8NHP8 PRCP_P42785 PRXL2A_Q9BRX8 SERPINH1_P50454 SIAE_Q9HAT2 SRPX_P78539 TFPI_P10646 TFRC_P02786 TGFBI_Q15582 TNC_P24821
group
BAT_woNE 21.159121 23.515881 20.742757 30.474014 22.284273 23.135425 24.691102 24.652254 21.251165 20.547066 ... 21.122663 24.413882 21.406060 26.462812 21.796696 25.050929 21.369273 21.272058 21.429750 21.270361
WAT_woNE 22.382288 24.767523 23.791789 31.820459 23.447894 24.320570 25.777819 25.803667 23.080337 22.831844 ... 23.125767 25.880017 22.730218 27.525384 23.460160 26.176678 23.005681 22.629342 24.927593 25.012742

2 rows × 30 columns

[64]:
plot = viz.get_parallel_plot(selected_df,
                      identifier='parallel_plot',
                      args={"title": "Parallel Plot Downregulated Proteins", "zscore":False, "group":"group"})
iplot(plot.figure)
viz.save_DASH_plot(plot, name='processed_data_parallel_plot_upregulated_proteins', plot_format='svg', directory=analysis_dir)

Functional Enrichment – GO Biological Processes

[65]:
annotation_query = 'MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(bp:Biological_process) WHERE (p.name +"_"+p.id) IN [{}] RETURN DISTINCT (p.name +"_"+p.id) AS identifier, bp.name AS annotation'
prots = ["'{}'".format(p) for p in processed_data.drop(['sample', 'group'], axis=1).columns.unique().tolist()]
[66]:
driver = connector.getGraphDatabaseConnectionConfiguration()
annotation = connector.getCursorData(driver, annotation_query.format(",".join(prots)))
[67]:
annotation.head()
[67]:
annotation identifier
0 mitochondrial genome maintenance TYMP_P19971
1 regulation of DNA recombination ALYREF_Q86V81
2 maltose metabolic process GAA_P10253
3 ribosomal large subunit assembly RPL11_P62913
4 ribosomal large subunit assembly RPL5_P46777
[68]:
enrichment = analytics.run_up_down_regulation_enrichment(aov_result, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.01, lfc_cutoff=1)
C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\pandas\core\frame.py:6692: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.


C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\pandas\core\frame.py:6692: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.


[69]:
figures = viz.get_enrichment_plots(enrichment, identifier='enrichment', args={'width':2200})
i = 0
for fig in figures:
    iplot(fig.figure)
    viz.save_DASH_plot(fig, name="enrichment_"+str(i), plot_format='png', directory='.')
    i += 1

Generating Knowledge Graph

All the connections stored in CKG’s knowledge graph can be used to annotate the list of differentially regulated proteins. These annotations based on associations to Biological processes, diseases, tissues, protein complexes or drugs can be used to build a knowledge subgraph specific for this study.

[70]:
project_knowledge = knowledge.Knowledge(identifier='Batokines',
                              data=None,
                              nodes={},
                              relationships={},
                              queries_file=None,
                              colors={},
                              graph=None,
                              report={})

[71]:
regulation = upregulated_proteins[['identifier', 'log2FC']]
regulation['reg'] = 'Upregulated'
[72]:
project_knowledge.generate_knowledge_from_edgelist(edgelist=regulation,
                                                   entity1='upregulated',
                                                   entity2='Protein',
                                                   source='reg',
                                                   target='identifier',
                                                   rtype='is_upregulated',
                                                   weight='log2FC')
[73]:
regulation = downregulated_proteins[['identifier', 'log2FC']]
regulation['log2FC'] = np.abs(regulation['log2FC'])
regulation['reg'] = 'Downregulated'
[74]:
project_knowledge.generate_knowledge_from_edgelist(edgelist=regulation,
                                                   entity1='downregulated',
                                                   entity2='Protein',
                                                   source='reg',
                                                   target='identifier',
                                                   rtype='is_downregulated',
                                                   weight='log2FC')
[75]:
nodes, rels = project_knowledge.generate_knowledge_from_enrichment(data=enrichment, name="Biologicalprocesses")
project_knowledge.update_nodes(nodes)
project_knowledge.update_relationships(rels)

Disease Associations

[76]:
disease_query = 'MATCH (p:Protein)-[r:ASSOCIATED_WITH]->(d:Disease)-[:HAS_PARENT*2..4]->(pd:Disease) WHERE toFloat(r.score)>1.5 AND (p.name+"_"+p.id) IN [{}] AND pd.name="disease of metabolism" RETURN (p.name+"_"+p.id) AS protein, d.name AS disease, d.id AS disease_id, r.score AS score'
regulated_prots = ["'{}'".format(p) for p in upregulated_proteins.identifier.unique().tolist() + downregulated_proteins.identifier.unique().tolist()]
associations = connector.getCursorData(driver, disease_query.format(",".join(regulated_prots)))
diseases = associations['disease_id'].unique().tolist()
[77]:
associations.head()
[77]:
disease disease_id protein score
0 amyloidosis DOID:9120 ITM2B_Q9Y287 4.543
1 amyloidosis DOID:9120 IDE_P14735 2.600
2 amyloidosis DOID:9120 GRN_P28799 1.597
3 amyloidosis DOID:9120 B2M_P61769 4.647
4 amyloidosis DOID:9120 MFGE8_Q08431 4.314
[78]:
net = viz.get_network(associations,
                      identifier='disease_network',
                      args={'source': 'protein',
                            'target': 'disease',
                            'values': 'score',
                            'node_size': 'degree',
                            'title': 'Disease Network',
                            'color_weight': False})
viz.visualize_notebook_network(net['notebook'])
[79]:
project_knowledge.generate_knowledge_from_edgelist(edgelist=associations,
                                                   entity1='Protein',
                                                   entity2='Disease',
                                                   source='protein',
                                                   target='disease',
                                                   rtype='associated_with',
                                                   weight='score')


Tissue Associations

Brown Adipose Tissue

[80]:
tissue_query = 'MATCH (p:Protein)-[r:ASSOCIATED_WITH]->(t:Tissue) WHERE t.id="BTO:0000156" AND (p.name+"_"+p.id) IN [{}] RETURN (p.name+"_"+p.id) AS protein, t.name AS tissue, r.score AS score'

associations = connector.getCursorData(driver, tissue_query.format(",".join(regulated_prots)))

associations

[81]:
project_knowledge.generate_knowledge_from_edgelist(edgelist=associations,
                                                   entity1='Protein',
                                                   entity2='Tissue',
                                                   source='protein',
                                                   target='tissue',
                                                   rtype='associated_with',
                                                   weight='score')


White Adipose Tissue

[82]:
tissue_query = 'MATCH (p:Protein)-[r:ASSOCIATED_WITH]->(t:Tissue) WHERE t.id="BTO:0001456" AND (p.name+"_"+p.id) IN [{}] RETURN (p.name+"_"+p.id) AS protein, t.name AS tissue, r.score AS score'

associations = connector.getCursorData(driver, tissue_query.format(",".join(regulated_prots)))
[83]:
associations.head()
[83]:
protein score tissue
0 COL15A1_P39059 0.725 white adipose tissue
1 B2M_P61769 1.609 white adipose tissue
2 SERPINH1_P50454 1.049 white adipose tissue
3 COL6A3_P12111 1.650 white adipose tissue
4 CTSL_P07711 1.310 white adipose tissue
[84]:
project_knowledge.generate_knowledge_from_edgelist(edgelist=associations,
                                                   entity1='Protein',
                                                   entity2='Tissue',
                                                   source='protein',
                                                   target='tissue',
                                                   rtype='associated_with',
                                                   weight='score')


Protein Complexes

[85]:
complexes_query = 'MATCH (p:Protein)-[r:IS_SUBUNIT_OF]->(c:Complex) WHERE (p.name+"_"+p.id) IN [{}] RETURN (p.name+"_"+p.id) AS protein, c.name AS complex, r.score AS score'

associations = connector.getCursorData(driver, complexes_query.format(",".join(regulated_prots)))
associations['score'] = 0.0
[86]:
associations.head()
[86]:
complex protein score
0 NCOR1 complex TRIM28_Q13263 0.0
1 HDAC10-PAX3-KAP1 complex TRIM28_Q13263 0.0
2 ITGA9-ITGB1-TNC complex TNC_P24821 0.0
3 Emerin complex 24 C1QBP_Q07021 0.0
4 TF-TFRC complex TFRC_P02786 0.0
[87]:
project_knowledge.generate_knowledge_from_edgelist(edgelist=associations,
                                                   entity1='Protein',
                                                   entity2='Complex',
                                                   source='protein',
                                                   target='complex',
                                                   rtype='is_subunit_of',
                                                   weight='score')


Drug Associations

[88]:
drugs_query = 'MATCH (d:Drug)-[r:ACTS_ON]->(p:Protein) WHERE toFloat(r.score)>0.7 AND (p.name+"_"+p.id) IN [{}] RETURN (p.name+"_"+p.id) AS protein, d.name AS drug, r.score AS score'
associations = connector.getCursorData(driver, drugs_query.format(",".join(regulated_prots)))
[89]:
associations.head()
[89]:
drug protein score
0 Formic acid TFRC_P02786 0.800
1 Magnesium cation B2M_P61769 0.958
2 Magnesium cation HEXB_P07686 0.836
3 Magnesium cation HEXA_P06865 0.836
4 Magnesium cation KARS1_Q15046 0.800
[90]:
project_knowledge.generate_knowledge_from_edgelist(edgelist=associations,
                                                   entity1='Protein',
                                                   entity2='Drug',
                                                   source='protein',
                                                   target='drug',
                                                   rtype='acts_on',
                                                   weight='score')


Visualize the Knowledge Graph

[91]:
project_knowledge.generate_report(visualizations=['sankey'], summarize=True)
project_knowledge.report.visualize_report(environment='notebook')
[91]:
[]
[92]:
project_knowledge.generate_report(visualizations=['network'], summarize=True)
project_knowledge.report.visualize_report(environment='notebook')[0]